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Abstract 

A common metaphor for describing development is a rugged "epigenetic landscape" where cell fates are represented as 
attracting valleys resulting from a complex regulatory network. Here, we introduce a framework for explicitly constructing 
epigenetic landscapes that combines genomic data with techniques from spin-glass physics. Each cell fate is a dynamic 
attractor, yet cells can change fate in response to external signals. Our model suggests that partially reprogrammed cells are 
a natural consequence of high-dimensional landscapes, and predicts that partially reprogrammed cells should be hybrids 
that co-express genes from multiple cell fates. We verify this prediction by reanalyzing existing datasets. Our model 
reproduces known reprogramming protocols and identifies candidate transcription factors for reprogramming to novel cell 
fates, suggesting epigenetic landscapes are a powerful paradigm for understanding cellular identity. 
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Introduction 

Understanding the molecular basis of cellular identity and 
differentiation is a major goal of modern biology. This is especially 
true in light of the work of Takahashi and Yamanaka demon- 
strating that the overexpression of just four transcription factors 
(TFs) is sufficient to convert somatic fibroblasts into cells 
resembling embryonic stem cells (ESCs), dubbed induced plurip- 
otent stem cells (iPSGs) [1]. The idea of using a small set of TFs to 
reprogram cell fate has proven to be extremely versatile and 
reprogramming protocols now exist for generating neurons [2], 
cardiomyocytes [3] , liver cells [4,5] , neural progenitor cells (NPC) 
[6], and thyroid [7] (see reviews [8,9] for more details). Despite 
these revolutionary experimental advances, cell fate is still poorly 
understood mechanistically and theoretically. Recent experiments 
suggest cell fates can be viewed as high-dimensional attractor states 
of the gene regulatory networks underlying cellular identity [10]. 
In particular, cell fates are characterized by a robust gene 
expression and epigenetic state resulting from the complex 
interplay of transcriptional regulation, chromatin regulators, 
non-coding and microRNAs, and signal transduction pathways. 

These experiments have renewed interest in the idea of an 
'epigenetic landscape' that underlies cellular identity [1 1-15]. The 
landscape picture requires several key features to be consistent 
with experimental observations (see Figure 1). All cell fates must be 
robust attractors, yet allow cells to change fate through rare 
stochastic transitions [8,16] as in cellular reprogramming 



experiments (Figure lA). A common result of reprogramming is 
not the desired cell fate, but partially reprogrammed cells [17,18]. 
These results suggest that the landscape is rugged and may contain 
additional spurious attractors corresponding to cell fates that do 
not naturally occur in vivo. In addition, environmental and 
external signals can control cell fates. Some environments stabilize 
particular cell fates (Figure IB). A dramatic example of this is a 
protocol for reprogramming to neural progenitor cells (NPCs) that 
is identical to Yamanaka's protocol for reprogramming to ESC 
except for the culturing media [19]. Other external signals 
deterministically switch cell fates, as occurs in normal development 
(Figure 1 C) [20] . Together, these imply the landscape is a dynamic 
entity that depends on environmental signals. 

The recent experimental progress has inspired several different 
theoretical approaches to understand the epigenetic landscape and 
the underlying gene regulatory networks governing cell fates. One 
focus has been on explicit construction of landscapes for specific 
cell fate decisions such as the erythroid vs myeloid choice in 
hemopoietic development [21], pancreatic cell fates [22], or C. 
elegans vulva development [23]. Other network based approaches 
use experimental data to constrain the possible networks [24,25]. 
A second area of work is based on understanding the underlying 
gene regulatory network [26,27]. A recent paper [28] attempts to 
combine the network and landscape picture by using the network 
entropy to define a landscape. On a more abstract level, there has 
been a renewed interest in understanding Waddington's land- 
scape mathematically using ideas from dynamical systems and 
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Author Summary 

Traditionally, standard development has been viewed as a 
one-way process; an organism starts as a single cell 
(embryonic stem cell, ESC) that divides into a multitude of 
mature cell types (skin cells, heart, liver, etc). But, in 2006 
Takahashi and Yamanaka revolutionized this view by 
stochastically converting skin cells into cell types resem- 
bling ESC (called induced pluripotent stem cells, iPSC). 
Following this groundbreaking experiment, other repro- 
gramming protocols have been found, so now scientists 
can switch between a variety of cell types such as ESC, 
skin, liver, neurons, and heart. This has already revolution- 
ized the understanding of biology and could change the 
future of medicine. A common metaphor for development 
is Waddington's landscape, in which an ESC is like a ball 
rolling down a hill which eventually ends in a valley 
(mature cell type). In this paper, we make this analogy 
more precise by developing a mathematical model of 
cellular development. Using data on real cell types, we can 
provide insight into existing reprogramming protocols and 
potentially predict new reprogramming protocols. 



reprogramming protocols to novel cell fates. Taken together, these 
results suggest that epigenetic landscapes represent a powerful 
framework for understanding the molecular circuitry and dynam- 
ics that gives rise to cell fate. 

The organization of the paper is as follows. First, we explain the 
motivation for using an attractor neural network to model the 
epigenetic landscape. Second, we define the state space for the 
model and the actual biological data used to construct the state 
space. Third, we give an overview of our landscape model (with 
details given in Table 1 and Materials and Methods: Landscape 
Model). Next, we show that our mathematical model captures the 
essential experimental features of cellular identity. We then show 
that our model naturally explains the existence of partially 
reprogrammed cells and makes predictions about their gene 
expression profiles. We verify this by reanalyzing experimental 
data. Finally, we show that our model can identify key 
reprogramming genes in existing reprogramming protocols, 
suggesting it can be used to identify candidate TF for reprogram- 
ming to novel cell fates. We conclude by discussing the 
implications of our mathematical model for understanding cellular 
identity and reprogramming. 



nonequilibrium statistical mechanics [15,29]. Most of these models 
focus on in vivo developmental decisions and hence consider the 
dynamics of a few genes or proteins. 

Here, we present a new modeling framework to construct a 
global (i.e. all cell fates and all TFs) epigenetic landscape that 
combines techniques from spin glass physics with whole genome 
expression profiles. We were inspired by the successful application 
of spin glasses to model neural networks [30-33] and protein 
folding landscapes [34]. Here, we construct an epigenetic 
landscape model for cellular identity with 63 stable cell fates and 
1337 TFs using cell-fate specific, mouse microarray gene 
expression data. Each cell fate is a robust attractor, yet cells can 
deterministically switch fates in response to external signals. Our 
model provides a unified framework to discuss differentiation and 
reprogramming. It also naturally explains the existence of partially 
reprogrammed cell fates as 'spurious' attractors resulting from the 
high dimensionality of the landscape. Our model predicts, and we 
verify, that partially reprogrammed cells are hybrids that co- 
express TFs of multiple naturally occurring cell fates. Finally, our 
model reproduces known reprogramming protocols to iPSCs, 
heart, liver, NPC, and thyroid, and has the potential for designing 



Results 

Motivation fronn attractor neural networks 

The Takahashi and Yamanaka reprogramming experiments [1] 
are reminiscent of content-addressable memory and attractor 
neural networks. First, let us introduce a content-addressable 
memory with a paraphrasing of the original Hopfield paper. A 
content-addressable memory allows one to retrieve a full memory 
based on sufficient partial information. For example, suppose the 
complete stored memory is "John J. Hopfield, Neural networks 
and physical systems with emergent collective computational 
abilities (1982)." A content-addressable memory is capable of 
retrieving the full memory based on partial, incomplete input. 
Therefore, the details "Hopfield," "Neural networks," and "1982" 
could be enough to recall the full memory. 

In the Yamanaka reprogramming protocol, overexpressing 
only four TFs is enough for a fibroblast to "recall" the global TF 
expression of an ESC. A content-addressable memory is 
naturally represented as a basin of attraction in a dynamical 
system, with partial recall corresponding to entering the basin of 
attraction and full recall corresponding to reaching the 
minimum of the basin. Hopfield attractor neural networks 



A. Minimal Landscape B. Cell Stabilized C. Cell Switch due 

by Environment to External Signal 




Figure 1. Phenotypic landscape. These are illustrative cartoons of the cell fate attractor landscape. (A) The minimal cellular identity landscape. 
Each cell fate is a basin of attraction (black circles). Reprogramming between different cell fates (1 and 2) can occur probabilistically via different 
trajectories (black paths). Partially reprogrammed cells (PRC) exist as smaller, spurious, basins of attraction (red circle) that can be experimentally 
observed by reprogramming experiments (example trajectory in red). (B) Same cellular identity landscape in the presence of a stabilizing 
environment (ex. favorable culturing medium) for cell fate 2. The environment increases the radius and depth of the cell fate 2 basin of attraction. (C) 
Landscape in the presence of an external signal that gives rise to differentiation from cell fate 1 to cell fate 2 (ex. growth factors associated with 
differentiation). Notice the low energy path between the cell fates that drives switching from cell fate 1 to cell fate 2. 
doi:1 0.1 371 /journal.pcbi.1 003734.g001 
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Table 1. Mathematical model of cell identity landscape. 





Landscape Term: Index Notation 


Landscape Term: Matrix Notation (dim.) 


Biological Interpretation 


H = Hbasin + Hhias + ^culture + H switch 




Total landscape. 


1 N N 

^, . - _iy" c.r.v. 


2 


Produces cell basins of attraction. 


N 

Hbias = — ^ BiSi 
i=\ 


Hbias = -B^S 


External control of individual genes, i.e. inducible 
expression. 


H,uiture = -NY,¥'a^ 




External control of specific cell basins, i.e. 
culturing conditions. 


^ M=l v=l 


N J 
^switch = — ym Ga 


Cell switching by signals, i.e. in vivo development. 


N 




Number of TPs, labeled by In this paper 
7V=1337. 


p 




Number of cell fates, labeled by ^.i, v. In this paper 

p = 63. 


Si 


S (A^xl) 


State (±1) ofi"' TF. 






State (± 1) of f'' TF in cell fate pi. 


1 ^ 

^ - TV 


A=l-^^^ ipxp) 
N 


Correlation between cell fate p and v. 


1 p p 

li=\ v=l 


J=\-A^A-'^ {NxN) 
N 


Interaction strength between / and j. 


Bi 


B (A^xl) 


External control of TF. 




hipxl) 


External control of p'^^ cell fate. 




m=^^S ipxl) 


Overlap of S on cell fate p. 


/- N 

a''=j2(A-'rm'=J2niSi 


a = A~^m = riS ipxl) 


Projection of S on cell fate p. 


^ v=l 


r]=^A-'^{pxN) 


Predictivity of i'^ TF in cell fate p. 




Gipxp) 


Signal dependent coupling that drives cell fate v 
to cell fate p 



This table provides a summary of the landscape model and the biological interpretation of each term. The first column is written in index notation, while the second 
column is the same term in matrix notation with the dimension of the term given in parenthesis. If no dimension is listed, the term is a single number. 
doi:1 0.1 371/journal.pcbi.1 003734.t001 



[30,31,33] are a general method to take an input set of vectors 
("memories") and explicitly construct a unique, global, land- 
scape such that each input vector is a global minimum and has a 
basin of attraction. In what follows, we will exploit the analogy 
between associative memory in attractor neural networks and 
cellular reprogramming to explicitly construct the epigenetic 
landscape underlying cellular identity. 

The epigenetic landscape 

Our goal is to model the global epigenetic landscape 
involving all cell fates by using genome wide data. Currently, 
microarrays are the only technology with genome wide data for 
a multitude of cell fates (although RNA-seq and other 
technologies will likely be useful in the future). Specifically, 
we compiled a dataset of 601 mouse whole genome micro- 
arrays (details in Materials and Methods: Data Analysis) 
resulting in the gene expression for TV = 1337 transcription 
factors for p = 63 cell fates. We restricted our considerations to 
TFs due to their importance in cellular reprogramming and 
differentiation. However, our model can be easily generalized 



to include other important genes. To robustly compare 
microarrays from multiple platforms, we converted the raw 
expression data into a rank ordered list. We assumed that gene 
expression is log-normal distributed (the minimal-assumption 
model for positive-definite random numbers such as gene 
expression) and assigned a z-score to each TF. The final output 
of this procedure is that it assigns each TF in every cell fate a z- 
score gene expression. 

This continuous gene expression could be used to construct 
our epigenetic landscapes. However, for mathematical conve- 
nience, we discretize the continuous gene expression data into 
high expression (+ 1 for z-score > =0) and low expression (— 1 
for z-score <0). See Text SI for an extended discussion on 
continuous vs discrete TF expression in attractor neural 
networks. 

This discretization process is biologically plausible. Cellular 
identity and differentiation are largely controlled by epigenetics, 
especially histone modifications (HMs) [35] (Figure 2A). Epige- 
netics primarily controls the accessibility of DNA and depending 
on the HM, the DNA can be stabilized in an open or closed 
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Figure 2. Overview of model. (A) Histone 3 tri-methylation at lysine 4 (K4) is associated with active genes, while histone 3 tri-methylation at lysine 
27 (K27) is associated with repressed genes. (B) Conditional probability distribution of histone modification (HM) given transcription factor (TF) 
expression levels derived by comparing microarray data with HM data from [36,37]. Notice the sharp threshold (black line) between expression levels 
of active and inactive TFs. (C) For mathematical convenience, we take the continuous TF expression levels and convert it to binary states (z-score 
> = 0 to +1 and z-score <0 to — 1). This binarization is consistent with the result from (B). (D) An arbitrary state is represented by a vector S of + 1, 
with each dimension in the vector space representing the state of a TF. The natural cell fates form a subspace (gray plane). The landscape model is 
based on the orthogonal projection of the TF state onto this subspace. (E) The dynamics of the landscape model for different initial conditions for a 
fully connected interaction matrix Jy and a diluted (non-equilibrium) interaction matrix where 20% of interactions have been randomly deleted. Plot 
shows the projection of 5* on embryonic stem cells (ESC) as function of time. Notice the large basins of attraction (red bracket). Parameters used were 
P = 2.2 and burst errors of 2% every 5000 spin updates. (F) Simulations showing how a common myeloid progenitor (CMP) can differentiate into 
either granulo-monocytic progenitors (GMP) or megakaryocyte-erythroid progenitors (MEP) in response to two distinct external signals. All 
trajectories used ^5 = 2.2. For signal 1, we set G^^^^^^^ = 0.5 and all other G^^' = 0. For signal 2, we set g^^^'^^^ = 0.5 and all other G^" = 0. 
doi:1 0.1 371 /journal.pcbi.1 003734.g002 



configuration. Using global HM data [36,37] and comparing it to 
microarray data, we created a conditional probability distribution 
of having a HM given a TF expression level (Figure 2B). We find 
that between a z-score of —0.5 to 0.5 there is a sharp threshold 
which distinguishes genes with the activating modification of 
histone 3 tri-methylation at lysine 4 (K4) from genes with the 
inactivating modification of histone 3 tri-methylation at lysine 27 
(K27) and poised/bivalent genes (both K4 and K27). This 
provides a potential biological justification to our discretization. 
In summary, we take the continuous gene expression and binarize 
(Figure 2C). These binary (i.e. on/off) TF data are the only 
biological input into our model. 

In order to precisely describe the landscape results, we need to define 
the correct way to measure distances. One possible measure is the 
overlap (aka dot product or magnetization), defined for cell fate fi as: 

i=i 

where Sf is an arbitrary expression state and is the gene expression 



in the natural cell fate /i. The overlap between cell fate ji and state *S/ 
for exactly correlated, anti-correlated, or uncorrelated states is 1 , — 1 , 
or O respectively. 

Cell fates from similar lineages (ex. blood) often have similar gene 
expression patterns. For example, B cells and T cells have a 77% 
overlap in their gene expression profiles. Such large correlations 
between cell fates makes the overlap, m, a poor distance measure. In 
order to measure distances between highly correlated vectors, it is 
helpful to define the "projection" of a gene expression state St on a 
cell fate fi by 

p 

a^=^(^-ifW (2) 

v=l 

where is the inverse correlation matrix and nf is the overlap on 
cell fate v and is given by 
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The projection measures the orthogonal projection of a state 
Si onto the subspace spanned by naturally occurring cell fates, ^ 
(see Figure 2D and Text SI), and a perfect projection onto state jii 
is given by = 1 . In contrast with the overlap, B cells have zero 
projection on T cells, and vice versa. 

Our landscape assigns an "energy" to every global expression 
state. We emphasize that this energy does not correspond to 
physical energy consumption of ATP; instead it is an abstract 
energy that corresponds to stability and developmental potential of 
cell fates. The complete landscape H can be thought of as arising 
from four terms with a simple interpretation (see Figure 1): 

^ ~ Hbasin "l~ ^^bias "l~ ^culture "l~ ^switch (4) 

The first term, Hhasin, ensures that observed cell fates are valleys 
in our landscape (Figure lA). The second term, Hhias, describes 
biasing of specific TFs by experimentalists (not shown in Figure 1). 
The third term, H culture^ increases the radius and depth of cell fates 
that are favored by the environment or culturing conditions 
(Figure IB). Finally, in the presence of an external signal that gives 
rise to differentiation (ex. growth factors associated with differen- 
tiation), the fourth term, H switch^ opens a low energy path between 
the initial and final cell fates (Figure IC). We give a complete 
mathematical description of the model in the Materials and 
Methods: Landscape Model and a summary in Table 1. 

Cell fates are dynamic attractors that are responsive to 
signals 

We performed self-consistency checks for our model using two 
in silico experiments (see details in Materials and Methods: 
Simulations). To verify that naturally occurring cell fates are 
dynamic attractors, we randomly perturbed the gene expression 
profile of cells from the ESC state and then tracked the gene 
expression over time. Real biology has many potential sources of 
noise, and the asynchronous dynamics introduced above will 
likely underestimate the noise. To show that our model is still 
robust to other large sources of noise, in our simulations we also 
add in periodic bursts of noise by flipping a fixed percentage of 
TF states (2%) to mimic the observation that cellular divisions 
produce HM errors [38]. Figure 2E shows the projection of the 
TF state on the ESC state as a function of time. For a large 
number of starting conditions, after an initial transient, the system 
relaxes back to the ESC state (red bracket), explicitly demon- 
strating the existence of a large basin of attraction [10]. This is 
true even when we break detailed balance by making the 
interaction matrix asymmetric by randomly deleting 20% of 
interactions (Figure 2E Diluted). 

Our model can also deterministically switch between cell fates in 
response to differentiation signals. For example, the common 
myeloid progenitor (CMP) is a blood cell fate that in vivo can 
differentiate into either granulo-monocytic progenitors (GMP) or 
megakaryocyte-erythroid progenitors (MEP). In Figure 2F, we 
show in silico validation where we start the system in the CMP 
state and show the trajectories after applying either the GMP 
(signal 1, blue) or MEP (signal 2, red) differentiation signal, 
resulting in branching to two distinct cell fates. 

Partially reprogrammed cells as "spurious" attractors 

When performing a reprogramming experiment, besides the 
initial cell fate and the end goal cell fate, experimentalists often 
produce "novel cell fates", dubbed partially reprogrammed cells 
[17,18]. These partially reprogrammed cells have the character- 



istics of a stable cell fate (i.e. they can be passaged indefinitely in 
culture), but may express a mix of key markers for multiple cell 
fates and have a global gene expression that does not match any in 
vivo cell fate [18]. 

While the existence of partially reprogrammed cells was 
surprising to experimentalists, they have a natural interpretation 
in our model. One of the most generic properties of all attractor 
neural networks is that in addition to the desired attractors, the 
non-linearity of the dynamical process and topology of high- 
dimensional (in our case TV = 1337) vector spaces induces 
additional attractors, which are termed spurious attractors [33]. 
In our model, since the natural cell fates are the input vectors, 
these spurious attractors can be interpreted as potential cell fates 
that do not occur in vivo. These spurious attractors are predicted 
to be low-dimensional combinations, or hybrids (see Materials and 
Methods: Spurious Attractors and Text S 1 for details) that should 
also be stable attractors but with smaller basins of attraction. 

A priori, there are several valid hypotheses for the relationship 
between partially reprogrammed cells and natural cell fates. In the 
original experiments [17,18], it was expected that partially 
reprogrammed cells should be a hybrid of the starting and goal 
cell fate only (i.e. have a significant projection only on the starting 
or ending cell fate). Another hypothesis was that in a high- 
dimensional landscape, randomly chosen vectors should be 
orthogonal (Figure S2) (i.e. have a projection o^ a^O with all 
cell fates). However, our model predicts that partially repro- 
grammed cells should be low-dimensional hybrids of existing cell 
fates, but that they do not necessarily have to be a combination of 
the starting and goal cell fate. Mathematically, we predict that 
partially reprogrammed cells should only have a projection 
|a|> 0.106 (2 std above 0, see Figure S2) for a small number of 
natural cell fates. Reanalyzing existing genome-wide datasets on 
partially reprogrammed cells (Table 2) validates the prediction of 
our model that partially reprogrammed cells are low-dimensional 
hybrids of existing cell fates. This qualitative agreement between 
the predicted spurious attractors and the partially reprogrammed 
states is independent of details of our landscape function. 
Importantly, such hybrid states are a generic property of all 
attractor-based landscape models and hence represents an 
important criteria for judging whether attractor-based models 
are suitable for describing epigenetic landscapes. 

Identifying transcription factors for cellular 
reprogramming 

Our landscape model provides a quantitative method to identify 
"predictive" TFs for a given cell fate. These predictive TFs can be 
used as markers of a cell fate and are potential candidates for 
reprogramming protocols. We expect reprogramming TFs to be a 
subset of all predictive TFs but not all predictive TFs will lead to 
successful reprogramming. For example, cell-specific downstream 
targets of reprogramming TFs are likely to also be highly 
predictive for a cell type but may not lead to successful 
reprogramming. 

Most reprogramming experiments follow an experimental 
protocol similar to the one outlined by Takahashi and Yamanaka 
in their seminal paper [1,8]. Initially the starting cells (usually 
mouse embryonic fibroblasts, MEFs) are infected with viruses 
containing all the TFs of interest. The original Yamanaka 
experiment over-expressed 24 TFs [1], while more recent 
experiments usually start with about 10 TFs [2-6]. Several days 
after infection, the cells are switched to culturing conditions that 
support the desired final cell fate. If an experiment is successful, 
cells resembling the desired cell fate will appear after a few weeks. 
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Table 2. Partially reprogrammed cells as spurious attractors. 





Cell line 


Start 


Goal 


Highest projecting states (projection) 


A2 [17] 


MEF 


ESC 


ESC (0.178), MSC (0.158), myoblast (0.142), MEP (0.129), blood vessel (0.113), keratinocyte (0.112), medullary thymic epithelial 
(-0.111), adipose - brown (-0.117), NK (-0.130), CMP (-0.138) 


B3 [17] 


MEF 


ESC 


ESC (0.222), MSC (0.161), blood vessel (0.139), myoblast (0.138), GMP (0.127), kidney (0.111), MEP (0.107), cornea (0.107), 
NK (-0.129) 


BIV1+ [18] 


B Cell 


ESC 


myoblast (0.181), prostate (0.164), MSC (0.154), MEP (0.138), keratinocyte (0.136), cornea (0.125), ESC (0.111), intestine - Paneth 
cell (-0.111), CMP (-0.122) 


BIV1- [18] 


B Cell 


ESC 


ESC (0.382), EpiSC (0.184), MEP (0.160), myoblast (0.145), NSC (-0.108), T Cell (-0.115), skeletal muscle (-0.117), CMP (-0.154) 


MCV6 [18] 


MEF 


ESC 


MEP (0.155), myoblast (0.150), ESC (0.149), keratinocyte (0.145), CLP (0.107), GMP (0.107), cornea (0.107), CMP (-0.130) 


MCV8 [18] 


MEF 


ESC 


ESC (0.203), MEP (0.191), myoblast (0.160), cornea (0.119), prostate (0.113), skeletal muscle (-0.141), CMP (-0.142) 


Partially reprogrammed cell lines (first column) and their significant projections (2 std above noise or a > 0.1 06) onto "natural" cell fates based on microarray data. 
Bold indicates 3 std above noise or \a\>0A59. Abbreviations: CLP, Common Lymphoid Progenitor; CMP, Common Myeloid Progenitor; EpiSC, epiblast stem cell; ESC, 
embryonic stem cell; GMP, Granulocyte-Monocyte Progenitor; MEF, mouse embryonic fibroblast; MEP, Megakaryocyte-Erythroid Progenitor; MSC, Mesenchymal stem 



cells; NK, Natural Killer cells; NSC, neural stem cells. 
doi:1 0.1 371/journal.pcbi.1 003734.t002 



This original list is then pruned to identify a "minimal" (essential) 
set of TFs that still allows for successful reprogramming. In many 
cases, the viruses are excised [39] to confirm that the the 
reprogramming does not depend on viral expression. Further- 
more, recent experiments indicate that the same TFs can be used 
to reprogram to a desired cell fate from multiple initial cell fates 
[16]. These experiments suggest that reprogramming TFs should 
be based on fmal, not initial, cell fate. 

Intuitively, reprogramming candidates should be both highly 
expressed and highly "predictive" of the desired cell fate. The TF 
z-score naturally defines high and low TF expression levels. Within 
our landscape, the "predictivity" rjl of the i^'^ TF for a given cell 
fate /J., is measured by its contribution to the potential energy of 
that cell fate, and is mathematically defined as: 

r,f=±(A-Ti: (5) 

V=l 

where is the cell fate correlation matrix and is the 

expression of TF / in cell fate v. We note that the projection and 
predictivity are directly related as can be seen by 

«"=E'?f5, (6) 

i=l 

where f]1 is the predictivity of TF / in cell fate fi and Si is an 
arbitrary gene expression state. 

For a desired target cell fate, TFs that are high (low) in both 
predictivity and expression in that cell fate are candidates for over 
expression (knock out) in reprogramming (see Figure 3A). For a 
simple, single measure of reprogramming efficacy of a TF, the 
predictivity and expression can be multiplied together to give a 
"reprogramming score", where the top (bottom) rank order TFs 
are the best candidates for over expression (knock out). Figure 3 
shows the expression and predictivity for TFs in a variety of cell 
fates. In Figure 3B, we have explicitly labeled the TFs used in the 
original Yamanaka protocol for reprogramming to ESC. Consis- 
tent with our model, these TFs are both predictive and highly 
expressed. Figure 3C shows TFs that have been successfully used 
in any reprogramming protocol to ESCs [8] as well as the 
pluripotency genes (involved in maintaining stem cell fate) Zfp42 



[Rexl) [40] and NrObl {Daxl) [41]. Once again these genes are 
highly predictive for ESCs. As a further check on the biological 
validity of our predictions, we analyzed the GO Annotation of our 
top 50 candidates for ESC reprogramming (Table SI). Within 
these top TFs, 1 2 have successfully been used in reprogramming, 7 
are known pluripotency TFs, 16 are involving in cell differenti- 
ation, while 15 have no known function and are intriguing 
reprogramming candidates. Taken together this suggest that we 
are capturing the essential biology despite minimal biological data 
for input. 

While ESC have been studied in the most detail, recent 
experiments have reprogrammed (aka direct conversion) to 
other cell fates such as cardiomyocytes [3] (Figure 3D), liver 
[4,5] (Figure 3E), and thyroid [7] (Figure 3F). Once again we 
have explicitly labeled the TFs that have been successfully used 
for direct conversion. Notice that all of these TFs (except Mef2c) 
are highly predictive and highly expressed. Note thdit pi 9 Arf [5] 
used in the direct conversion to liver was not differentially 
expressed in our microarrays and therefore was not included in 
our model. 

We also examined TFs used in direct conversion to neural 
lineages. As discussed in [2], these TFs were chosen because they 
were known to be important in either neurons or neural 
progenitor cells (NPC). Figure 3F and 3G show the expression 
and predictivity of TFs for neural progenitor cells (NPC) [6] 
(Figure 3G), and neurons [2] respectively. Induced NPC were 
made using a four TF cocktail consisting of Pou3f2 {Brn2), Sox2, 
and Foxgl [6] . Our analysis shows that the first two of these TFs 
are predictive for NPCs while Foxgl is predictive for neural stem 
cells (NSC) (see Figure S3). Induced neurons (iN) can be made 
using the TFs Mytll, Pou3f2, and Ascll [2]. Consistent with their 
experimental design, we find that Mytll is highly predictive for 
mature neurons, while the remaining TFs {Pou3f2, Ascll) are 
predictive for NPCs. 

While it is not possible to perform statistical tests to test our 
examples due to the scarcity of reprogramming protocols, we 
performed a simple numerical exercise to gauge the predictive 
power of our model. The four Yamanaka factors are all in the top 
50 when ranked by their reprogramming score for ESCs (where 
the reprogramming score of a TF is defined as the product of the 
expression and predictivity scores of a TF). We randomly 
permuted TF labels and asked how often all four Yamanaka 
factors remained in the top 50. For a million independent 
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Figure 3. Identifying reprogramming candidates. For a given cell fate, we plot every differentially expressed transcription factor's (TF) 
predictivity (aka energy projection-contribution, rjf) vs TF expression level (z-score normalized). Unless otherwise stated all existing reprogramming 
protocols to a given cell fate are labeled. (A) Schematic illustrating predictivity vs expression level plots. The large positive (negative) predictivity and 
large positive (negative) gene expression TFs are candidates for over expression (knock out) in a reprogramming protocol. The TFs with z-score 
between —0.5 and 0.5 are highlighted in gray because Figure 2B suggests these TFs predictivity may be prone to extra noise induced by the data 
discretization. (B) Embryonic stem cell, ESC (induced pluripotent stem cells, iPSC). Original Takahashi and Yamanaka factors PouSfl {Oct 4), Sox2, Klf4, 
and Myc [1]. (C) Inset of ESC positive predictivity and gene expression. Zfp42 {Rexl] [40] and NrObl [Daxl] [41] are pluripotency markers that are not 
necessary to overexpress for reprogramming, while combinations of the remaining labeled TFs have been successfully used in reprogramming 
protocols [8]. (D) Heart (induced cardiomyocytes, iCM) [3]. (E) Liver (induced hepatocytes, iHep). There are two published protocols. One protocol 
used Hnf4a plus any of Foxal, Foxal, or FoxaS [4] while another used Gata4, FoxaS, Hnfla, and deletion of pl9Arf [5]. pi 9Arf was not differentially 
expressed in our microarrays and is not shown. (F) Thyroid [7]. (G) Neural Progenitor Cells, NPC (induced NPC, iNPC) used Pou3f2 {Bm2), Sox2, and 
Foxgl [6]. With our microarrays we find that Foxgl is not predictive for NPC but is predictive of neural stem cells (NSC) (see Figure S3). (H) Neurons 
(induced neuron, iN) [2]. The reprogramming protocol used a combination of factors that were known to be important to ether mature neurons 
{Mytll) or NPCs {Pou3f2, Ascll). (G) shows that Pou3f2 and Ascll are predictive of NPCs. 
doi:1 0.1 371 /journal.pcbi.1 003734.g003 
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permutations, this occurred only once, confirming that our model 
is capturing many essential aspects of cellular reprogramming. 

Our results suggest that epigenetic landscapes may be useful for 
rationally-designing reprogramming protocols to novel cell fates. 
To this end, we have used our model to identify candidate TFs for 
reprogramming, see File S5 for the top 50 candidates for 
overexpression for all cell fates and File S6 for top 50 candidates 
for knockouts for all cell fates. 

Discussion 

A common biological metaphor used to describe development 
and cellular reprogramming is a rugged "epigenetic landscape" 
which emerges from a complex gene regulatory network, with cell 
fates corresponding to attracting valleys in the landscape. Despite 
decades of biological innovation, the large number of genes and 
their complex interactions has prevented the quantitative model- 
ing of a global epigenetic landscape. To meet this challenge, we 
have developed a new quantitative framework of cellular identity 
to directly model the global, high-dimensional epigenetic land- 
scape. Using whole genome expression data, we constructed an 
epigenetic landscape based on techniques from spin glass physics 
and neural networks. Our landscape only depends on the 
experimentally determined gene expression of natural cell fates. 
Yet, it explains the existence of spurious cell fates (known as 
partially-reprogrammed cells) and can reproduce known repro- 
gramming protocols to embryonic stem cells, heart, liver, thyroid, 
neural progenitor cells, and neurons. More importantly, our model 
can be used to identify candidate transcription factors for 
reprogramming to novel cell fates. 

An interesting question is if spurious attractors are ubiquitous 
throughout the landscape, why does standard development not 
produce partially reprogrammed cells? The key is the difference in 
the dynamics. In cellular reprogramming, the starting cell fate is 
forced to express a small number of TF and this leads to a 
stochastic conversion to the desired cell fate (Figure lA). During 
this stochastic exploration of the landscape, there is only a weak 
bias towards the final state, so it is easy for the cells to get trapped 
in a metastable state. However, during standard development, the 
external signals actively reshape the landscape and open up low 
energy valleys between cell fates (Figure IC). This strong bias 
towards the final cell state results in a deterministic switch during 
which the spurious attractors are only a small road bump on the 
path to the final cell state. Therefore, it is not a surprise that 
partially reprogrammed cells are only found during cellular 
reprogramming and not during standard development. 

Epigenetic landscapes can also be used to identify important, or 
predictive, TFs for cell fates. The predictivity of a TF for a cell fate 
generalizes the idea of specificity. A TF is specific to a cell fate if it 
is expressed only on in a small subset of cell fates. In contrast with 
specificity, predictivity weighs the global correlations amongst cell 
fates when assessing the importance of a TF for a cell fate. Thus, 
the predictivity not only picks out important specific TFs, but also 
TFs that are lineage markers. For example, Brachyury (T) [42] is a 
general marker of mesodermal lineages. Since it is highly 
expressed in large a number of cell fates, it is not specific to any 
given cell fate. However, it is predictive because its expression is a 
strong indicator that a given cell fate is a mesodermal lineage. 

The concept of predictivity also yields new insights into the 
Yamanaka protocol. When the Yamanaka factors were first 
published, two of the four TFs, Pou5fl {Oct4) and Sox2 were 
known to be important for ESCs. In contrast, the role of the other 
two TFs, Klf4 and Myc, was not well understood [43]. It was 
quickly shown that Myc was was not essential to reprogramming 



{Oct4, Sox2, and Klf4 can reprogram alone), but nonetheless 
enhanced the efficacy of reprogramming [44] . The importance of 
Klf4 was surprising given that it is neither highly expressed nor 
specific for ESC. However, Klf4 is highly predictive of ESC (Table 
S2). For this reason, our model actually explains why Klf4 is a 
prime candidate for reprogramming to ESCs. 

We make several experimentally verifiable predictions. First, 
our model predicts the partially reprogrammed cells should be 
hybrids of existing natural cell fates. As more partially repro- 
grammed cells are studied, if they are found to either have high 
projection on only one cell fate {a^ ~ 1 for one fi) or no projections 
on any cell fates [a^^O for all //), this would call into question 
whether partially reprogrammed cells are truly the spurious 
attractors of an attractor neural network. Second, our model can 
be used to identify important, or predictive, TFs for cell fates. TFs 
with large positive (negative) predictivity should be positive 
(negative) markers for a cell fate. Additionally, for cellular 
reprogramming we predict that TFs with large positive (negative) 
predictivity and expression could be over expressed (knocked out) 
to reprogram to a desired cell fate. Therefore, our model has 
several predictions that can be tested against future experimental 
progress in the field. 

Our model has several limitations. First, a generic limitation for 
any method relying on microarrays to define gene expression is 
that one cannot distinguish between direct, causal, interactions 
and indirect, correlative, interactions. Therefore, predictivity can 
establish the importance of a gene, but further experiments are 
needed to determine if the predictive gene is the controller of the 
cell type or just a passive indicator of a cell type. Second, it fails to 
accurately capture the dynamics of reprogramming. Simulations 
of reprogramming with known protocols, such as the Yamanaka 
protocol, lead to rates of reprogramming that are comparable to 
the rates from a reprogramming simulation with a randomly 
selected protocol. This is likely due to the fact that cell fates are 
extremely stable and hence reprogramming is extremely rare. 
Third, our model does not directly explain the importance of the 
non-specific transcription factor Myc. Many protocols use Myc [8], 
but it can be replaced (with no deleterious effect) by short hairpin 
RNAs (shRNAs) [45] , or dropped completely from protocols at the 
expense of speed and less efficient reprogramming [44]. This 
suggests that Myc may have an alternative role and instead of 
being a biasing field, Bf, it may instead raise the effective noise of 
the system (i.e. decrease p). Another limitation is that based on the 
currently available experimental data, our landscape construction 
cannot definitively be distinguished from alternative constructions. 
For example, the interaction network could be constructed by such 
that it does not weigh each cell fate equally (as is currently done). 
This would have the effect of changing the relative stability of cell 
fates. Therefore, in the absence of more experimental data, our 
landscape and a weighted landscape cannot be distinguished. 

A popular approach to inferring landscapes from biology data 
are "Maximum Entropy" models. This method has been used to 
model firing neurons [46], protein configurations [47,48], and 
antibody diversity [49] . The Maximum Entropy approach takes as 
input large samples of biological data and a set of constraints and 
outputs a landscape that maximizes the entropy. While Maximum 
Entropy models can be used to infer landscapes with basins of 
attraction [50], it can quickly become a computationally 
challenging problem. Our approach differs from Maximum 
Entropy models in the following way. Since our goal is to model 
a landscape with basins of attractions, we make the ansatz that the 
landscape can be described by a Hopfield neural network. Then 
we insert real biological data, ^, to construct the landscape exactly. 
Our method requires no computational inference of parameters. 
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There are several natural extensions of the model discussed in 
this paper. The landscape could be constructed with additional 
biological input such as other genes, microRNAs, or histone 
modification data. This opens up possibilities of improving upon 
the high reprogramming rates achieved by overexpressing 
microRNAs [51] or synthetic mRNAs [52]. Another attractive 
element of the framework presented here is that it allows for a 
quantitative analysis of whole genome-wide expression states (see 
Table 2). This is likely to yield a more accurate classification of 
reprogrammed cells. Finally, directed differentiation protocols [53] 
attempt to mimic standard development in vitro and have proven 
to have high efficiency and fidelity. Future work will try to use our 
landscape to predict the necessary signaling factors for rationally 
designing more efficient directed differentiation protocols. Overall, 
epigenetic landscapes provide a unifying framework for cell 
identity, reprogramming, and directed differentiation, and our 
results suggest these landscapes can provide crucial insight into the 
molecular circuitry and dynamics that gives rise to cell fate. 

Materials and Methods 

Data analysis 

Here we present the details of the dataset. All data used in this 
paper are available in the online Supplementary Information and 
is organized as follows: 

• File S 1 : Microarray Sources. List of all microarrays used in this 
paper. 

• File S2: TF Z-Score. The z-score gene expression for each TF 
of natural cell fates in this paper. This data is post RMA 
normalization and averaging over multiple replicates for each 
natural cell fate. 

• File S3: TF Predictivity. The predictivity for each TF and cell 
type in this paper. 

• File S4: Partially Reprogrammed Cells Z-Score. The z-score 
gene expression for each TF of partially reprogrammed cell 
fate. This data is post RMA normalization and averaging over 
multiple replicates for each partially reprogrammed cell fate. 

• File S5: Overexpression Candidates. Top overexpression 
candidates to reprogram to various cell fates. 

• File S6: Knock-Out Candidates. Top knock-out candidates to 
reprogram to various cell fates. 

An older version of this manuscript, Arxiv v3 [54], has 
additional microarrays available that are unused in this version 
of the text. All microarrays used in this paper were taken from the 
public databases ArrayExpress (www.ebi.ac.uk/arrayexpress) or 
GEO (www.ncbi.nlm.nih.gov/geo). See File SI: Microarray 
Sources for details on where to obtain raw, pre-normalized and 
pre-averaged data. 

There are two datasets, the natural cell fates and the partially 
reprogrammed cells. For the natural cell fates, we only used the 
Affymetrix GeneChip Mouse Gene 1.0 ST platform due to the 
large number of available microarrays on ArrayExpress (www.ebi. 
ac.uk/arrayexpress) and the better technical design of the platform 
(1.0 ST has probe matches throughout a gene in contrast to just 
the 3' UTR in Affymetrix GeneChip Mouse Genome 430 2.0). 
There is limited data on partially reprogrammed cells so we used 
microarrays from Affymetrix GeneChip Mouse Genome 430 2.0. 

The raw microarray data was converted to an expression level 
as follows. Microarray probe-to-gene map was created with 
Bioconductor 2.10. All raw microarray files were initially 
processed by robust mean averaging (RMA) in MATLAB, and 
genes with multiple microarray probes were averaged. We did 



additional processing of this output for two reasons. First, we need 
to compare microarrays from multiple platforms, but the standard 
RMA output can vary significantly from platform to platform. 
Second, since gene expression is a set of positive definite numbers, 
the minimal assumption model of gene expression is a log-normal 
distribution. Therefore, to make robust comparisons across 
platforms, we used order statistics [55]. The RMA output was 
converted to a rank order. Next, we want to convert this rank 
order to the z-score of a log-normal distribution. We converting 
the rank to a percentile (for N genes, divide by 7V+ 1), and then 
this percentile into a normal z-score. For later mathematical 
convenience, we used a biased estimator (normalize by N not 
TV— 1) since then the Euclidean norm of each microarray gene 
expression is N. 

At this point, the natural dataset consisted of 60 1 microarrays 
with 20719 genes. Since we were interested in cellular identity, 
only transcription factors, transcription factor co-factors, or 
chromatin remodeling genes were kept (for short hand, referred 
to as transcription factors (TF) throughout the text) [56], leaving 
1715 TFs. 

As explained in the main text, since continuous (sigmoidal input) 
attractor neural networks and discrete attractor neural networks 
are known to have the same stable fixed points [57], we used the 
binarized gene expression. We binarized the gene expression by 
setting a positive z-score to + 1 and a negative z-score to — 1 . 
While this was mainly done for mathematical convenience, this is 
potentially biologically justified. Histone modifications (HM) either 
leave chromatin in an open, accessible configuration or a closed, 
inaccessible state [35]. We found global HM data for embryonic 
stem cells (ESC), mouse embryonic fibroblasts (MEF), and neural 
progenitor cells (NPC) [36,37]. Consequently, we used the global 
HM data for these three cell fates and compared them to 
microarray TF expression levels. This allowed us to create a 
conditional probability distribution of each HM for a given TF 
expression level (Figure 2B). We found a sharp cutoff (that 
coincides with a z-score of 0) which distinguished TFs with the 
activating modification of histone 3 tri-methylation at lysine 4 (K4) 
from TFs with the inactivating modification of histone 3 tri- 
methylation at lysine 27 (K27), poised/bivalent TFs (both K4 and 
K27), and no HM (most likely DNA methylation). This shows that 
our mathematical assumption is justified by the HM data. 

After the binarization of TF expression, all TFs that were not 
differentially expressed across cell fates (i.e. TFs that are always 
on/ always off in every cell fate) were dropped, leaving 1337 TFs. 
The binarized TF expression for the 63 cell fates was found by first 
binarizing all 601 microarrays and then taking the majority vote 
for each cell state (with ties broken by averaging the continuous 
data). The final result was the binary expression state for 63 cell 
fates. 

Microarrays for partially reprogrammed cells were on the 
Affymetrix GeneChip Mouse Genome 430 2.0 Array. The same 
procedure was used to convert raw microarray data to z-score 
expression. However, since different microarrays do not have the 
same genome coverage, the analysis comparing partially repro- 
grammed cells and natural cell fates used the TV =1329 TFs 
common to both platforms. 

Several self-consistency checks were performed on the data. 
First, the correlation matrix A^^ (explained in main text and 
below) was calculated for the original continuous data and for the 
binarized data (Figure SI). Both correlation matrices are consistent 
with each other showing binarization does not change the global 
correlations. Note that in the correlation matrix, cell fates have 
been grouped by tissue type, leading to a block diagonal form. 
Second, the expression state of all cell fates was constructed from 
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multiple micro array experiments. These different experiments 
were compared with each other and were within 2 standard 
deviations (std equal to l/v^~ 0.027) for all cell fates. This 
demonstrates that microarrays from multiple laboratories can be 
directly compared. 

Landscape model 

Here we give an overview of our epigenetic landscape model. 
The model is summarized in Table 1, and Text SI provides a 
supplementary overview of attractor neural networks. 

State space. Each TF (labeled by /, j) can be in a state 
Si = +\ where + 1 indicates the TF is active while — 1 indicates it 
is inactive. A general cell state is given by S, an A^=1337 
dimensional vector. There are p = 63 cell fates (labeled by fi, v). In 
cell type jn, the state of TF / is given by . The complete cell type 
data is a /? by TV matrix determined using our microarray data 
described above and these ^ are the only biological input into the 
landscape. 

Full landscape. The complete landscape H can be written as 
the following terms: 

^ ~ Hjjasin ~t" Hjjias + H culture "I" H switch (7) 

Our landscape assigns an "energy" to every global expression 
state. We emphasize that this energy does not correspond to 
physical energy consumption of ATP; instead it is an abstract 
energy that corresponds to stability and developmental potential of 
cell fates. Each of the four terms has a simple interpretation (see 
Figure 1). The first term, Hhasin, ensures that observed cell fates are 
valleys in our landscape (Figure lA). The second term, Hhias, 
describes biasing of specific TFs by experimentalists (not shown in 
Figure 1). The third term, H culture, increases the radius and depth 
of cell fates that are favored by the environment or culturing 
conditions (Figure IB). Finally, in the presence of an external 
signal that gives rise to differentiation (ex. growth factors 
associated with differentiation), the fourth term, H switch, opens a 
low energy path between the initial and fmal cell fates (Figure IC). 

Landscape details: Hj,asiif The gene expression profiles of 
naturally occurring cell fates must be minima of our landscape. 
This is ensured by the landscape term 

^ N N 

Hbasin = — 7^ ^ ^ ^i-^ij^j (^) 

In order to guarantee that cell fates are basins of attraction, we 
need to choose the "effective interaction" matrix, Jfj, which 
encodes how the yth TF influences the ith TF. Since we have 
highly correlated cell fates, we use the projection-method [32] (see 
Text SI section "Discrete, Projection Method" for extended 
discussion on this choice), which defines the interaction matrix as: 

^=1 v=l 

where are the natural cell fates and is the inverse of the 
correlation matrix between cell fates. Since our construction is 
based on correlations between gene expression profiles, Jfj 
includes the effect of "indirect" interactions between TFs / and j 
that are mediated through other TFs (see Text SI for additional 
mathematical explanation of this construction). While the current 



definition implies Jij is symmetric, this can easily be generalized to 
an asymmetric Jij (see later section Landscape vs Pseudo- 
Landscape for details). 

Landscape details: Hbias* The term Hbasin ensures that all 
cell fates are global minima of the landscape. However, additional 
terms in the landscape are needed in order to incorporate key 
experimental features. 

First, biologists can directly manipulate gene expression. For 
example, during the Yamanaka experiment, the TFs 
Pou5f\ (Oct4), Sox2, K\f4, and Myc are overexpressed in 
fibroblasts. Mathematically, we represent the overexpression of TF 
/ by a local biasing field Bf that ensures that = 1 . Therefore the 
Yamanaka reprogramming protocol enters the landscape as: 

N 

Hbias=-Y.BtSi (10) 

i=l 

where for the Yamanaka protocol, BpguSfi =Bsox2 =BicifA =BMyc 

00 and for any other TF /, the field 5/ = 0. 

Landscape details: H culture- Currently, the basins of attrac- 
tion Hbasin are all set to the same minima value. However, 
environmental signals (such as cell culture conditions) can stabilize 
and destabilize specific cell fates (see Figure IB). This can be 
incorporated into our landscape by terms such as: 

Heulture=-Nj2b'a^ (H) 

^=1 



= -^q5, (12) 

i^i 

where represents the culture biasing, and is the projection 
onto cell fate jii. This bias can be equivalently expressed at the 

level of TFs by defining a culture bias, C/, for the ith TF given 
by: 

Q=j2j2i^i^-Tif (13) 

For example during the Yamanaka protocol, cells are cultured 
in conditions favorable to ESC, which is mathematically 
represented by b^^^ > 0, while for all other cell fates fi, b^ = 0. 

Landscape details: H switch- During standard development, 
cells switch fates deterministically in response to external signals. 
We mathematically represent this using the term: 



Hswitch = - y 5] J^m^G^^a^ (14) 



1 N N 

where is the overlap on cell fate /i, is the projection onto cell 
fate V, and the matrix G^^ is the developmental signal matrix that 
is a dynamic entity and a function of developmental time and 
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external signals. We can equivalently write this in terms of 
transcription factors using the gene-interaction matrix, Kij, defined 
as: 

^^=-^EEE«ff^"(^"'n; (16) 

li=\ v = l p = l 

where ^ is the natural cell fate states, G^^ is the developmental 
signal matrix, and h the inverse correlation matrix. Since G^^ 
is asymmetric, Kij is also asymmetric and explicitly breaks detailed 
balance (see later section Landscape vs Pseudo-Landscape for 
details). 

We now explain the development signal matrix in more details. 
If G^^ > 0, this opens up a low energy path between cell fate v and 
cell fate ja. For example, during blood development, the common 
myeloid progenitor (CMP) can differentiate into either granulo- 
monocytic progenitors (GMP) or megakaryocyte-erythroid pro- 
genitors (MEP). The complicated external signals that induce 
switching from a CMP to GMP leads to G^^^'^^^>0 and all 
other G^^ = 0, while the signals that induce switching from a CMP 
to MEP leads to G^^^'^^^>0 and all other G^' = 0. We 
emphasize that this term is purely phenomenological and further 
research will be needed to directly connect the developmental 
biology signals (such as TGFp, JVNT, etc) to the matrix elements 
G^\ 

Dynamics. We have uniquely defined the landscape H. 
However, there are multiple ways to implement dynamics on this 
landscape. In this paper, we are primarily interested in the 
behavior of the stable fixed points and not dynamical trajectories. 
Therefore, we follow the standard convention in the attractor 
neural network literature and update the network by random, 
asynchronous updates (Glauber dynamics) [33]. Therefore, at 
each update, a random TF, /, is selected and updated according to 
the probability 

Mt)Si(t+l) 

where Sj is the expression state of the ith TF, P is an effective noise 
parameter, hi is the local field, and / is the time index. The local 
field hi is the gradient of the landscape (covariant derivative) 
defined for the full landscape H as: 

N N 

hi= Y^j^jSj+Bi+Ci+ Y^^y^j (^^) 

where /// is the basin-inducing interaction matrix, Bi is the 
experimentally induced bias on the /th TF, C/ is the culturing- 
condition specific bias on the /th TF, and is the developmental 
interaction matrix. 

We have introduced the effective noise parameter P=l/T (i.e. 
inverse temperature) that controls the level of stochasticity 
resulting from biochemical noise. When p^co, the update 
approaches a deterministic step function, while when each 
state is equally likely. Based on the currently available static 
genomic data, this update time cannot be directly related to 
biological time. Finally, we emphasize that since in this paper we 
are primarily concerned with the structure of the landscape, our 
results are independent of our choice of dynamics (see Text S 1 for 
detailed discussion on dynamics). 

Landscape vs pseudo-landscape. Currently, the interac- 
tions between TFs, are symmetric. In real biology, this is 



unlikely to be true. We can introduce asymmetry into the 
interactions by randomly deleting interactions (for example 
Figure 2E Diluted). This asymmetry means that influence of TF 
/ on TF j no longer equals the influence of TF j on TF /. This 
asymmetry breaks detailed balance and implies a non-Lyapunov 
pseudo-potential [29,33,58] and has been shown to be an 
additional source of noise on the basins of attraction [33]. 

We also note that the landscape term Hswitch is explicitly non- 
equilibrium and breaks detailed balance. Under one set of 
environmental conditions, G^^ > 0 while G^^ — 0 driving switching 
from v^iJ., while under a different set of environmental 
conditions, G^^>0 while G^^ =0 driving switching from fi^v. 
Therefore, by including H switch we are actually making our 
landscape a pseudo-landscape (i.e. non-Lypanouv). 

Simulations 

Here we include details of the simulations in this paper. For all 
simulations, we set ^= 1/0.45?^ 2.2 and evolved the system for 
100,000 TF updates. 

In Figure 2E, we demonstrate that we have basins of 
attraction. The initial conditions were created by taking the 
ESC expression vector and randomly flipping 15% of the TFs. 
After every 5000 updates of asynchronous dynamics, burst 
errors were introduced by randomly flipping 2% of TFs. For 
the asymmetric dilution, the standard interaction matrix Jij 
was created. Then 20% of matrix entries were randomly set to 
0. 

In Figure 2F, we demonstrate that the landscape can 
deterministically switch between basins. The initial conditions 
were always the CMP expression vector. For signal 1, we set 
GGMP,cmp^q^ and all other G^' =0. For signal 2, we set 
qMEP,cmp^q ^ and all other G^^ = 0. 

Spurious attractors 

Here we provide more details on spurious attractors and hybrid 
cell fates. As explained in more detail in Text SI, for the 
traditional Hopfield model, these spurious attractors take the form 
of odd-majority vote mixtures [33] (i.e. majority vote at each TF of 
3,5,7, . . . of the (Jf). The projection method also has the additional 
spurious attractors of any linear combination of that spans the 
discrete state space (see geometric interpretation given in Text SI) 
[32]. For convenience, we use the word hybrid as the collective 
term for either majority vote mixtures or linear combinations of 
existing cell fates. 

As discussed in the main text, the prediction of spurious 
attractors in the projection method inspired us to reexamine data 
on existing partially reprogrammed cells. Surprisingly, we found 
that partially reprogrammed cells could be thought of as hybrids of 
existing cell fates. However, we are currently only able to obtain 
qualitative agreement between partially reprogrammed cells and 
the predicted nature of the spurious attractors. While it is known 
that the projection method retains these odd-majority vote 
mixtures spurious attractors, the correlations between states 
implies these spurious attractors may no longer be symmetric 
mixtures. However, the exact nature of these spurious attractors is 
not known and will be explored in future work. 

Supporting Information 

Figure SI Cell fate correlation matrices. (A) Correlation 
matrix between cell fates for continuous data. (B) Correlation 
matrix for binarized data. 
(PDF) 
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Figure S2 Projection of a random vector on a given cell 
fate. Ten thousand binarized random vectors were created in 
MATLAB and projected onto the cellular sub-space. The 

histogram shows the distribution of the projections. The red line 
is a Gaussian fit to the histogram. The mean is practically zero 
while the standard deviation is 0.053. 
(PDF) 

Figure S3 Predictivity vs expression for NSC. Same type 
of plot as Figure 3. Labeled TFs are part of reprogramming 
protocol to NPC [6]. This illustrates that Foxgl is predictive for 
NSC, even though it is not for NPC. 
(PDF) 

File SI Microarray sources. List of all microarrays used in 

this paper. 

(TXT) 

File S2 TF Z-score. The z-score gene expression for each TF of 
natural cell fates in this paper. This data is post RMA 
normalization and averaging over multiple replicates for each 
natural cell fate. 
(TXT) 

File S3 TF predictivity. The predictivity for each TF and cell 
fate in this paper. 

(TXT) 

File S4 Partially reprogrammed cells Z-score. Partially 
Reprogrammed Cells Z-Score. The z-score gene expression for 
each TF of partially reprogrammed cell fate. This data is post 
RMA normalization and averaging over multiple replicates for 
each partially reprogrammed cell fate. 
(TXT) 

File S5 Over expression candidates. Top overexpression 

candidates to reprogram to various cell fates. 

(TXT) 
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